Unsupervised Learning Project

The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Domain:

Context:

Attribute Information:

Learning Outcomes:

Objective:

1. Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm

1.1 Import all the necessary libraries

1.2 Read the data

Observations

1.3 Copying data to preserve the original data

1.4 Shape the data

Observations

1.5 Data type of each attribute

Observations

1.6 Finding unique data

1.7 5 point summary of numerical attributes

Observation

1.8 Check for the Duplicate and Null values

We will replace the missing values with the median values of that particular column.

After imputation

Observations

1.9 Functions for plotting various graphs

1.10 Check for the Outliers

Observations - One can see that, following column attributes are affected by the outliers:

For better clarity we will plot box plots of indivdual columns.

Observations

Observations

Observations

Outlier Treatement

Observations

2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why

2.1 Univariate EDA

Observations

Observations

2.2 Bivariate EDA

Observatons

Observations

Observations: Following pairs of attributes have very high correlation among them (>=.95)

2.3 Multivariate Visualization

Observations

Let's plot pairplot for columns that have a relatively strong correlation with class variable...

Observations

2.4 Undoubtedly, the present dataset is a potential candidate for dimensionality reduction techniques such as PCA, however it will give the reduced number of uncorrelated features in the transformed domain i.e., the new features will be linear combination of some the existing features. Before, applying PCA let us find out which are the important attributes regarding present analysis along with their validation.

Let us start with </b>Univariate feature selection</b> approaches, We will use

Then we will also use two more sophisticated approaches:

2.4.1 Separate independent and target attributes

2.4.2 Chi square function

Observations

2.4.3 f_classif function

Observations

2.4.4 Feature selection using RFE with Logistic regression model

Observations

2.4.5 Feature selection using LASSO with Logistic regression model

Observations

From the above analysis it is clear that, features selected by LASSO (or RFE) have to be the part of the analysis since they are marked as the important features by these methods. Also, for these subset of features we get minimum number of highly correlated (positive or negative) feature pairs (4 in this case).

Observations

2.4.6 Feature Importance using Random Forest Classifier

Observations

2.5 Understanding the target variable

Observations

Observations

3. Split the data into train and test (Suggestion: specify “random state” if you are using train_test_split from Sklearn)

4. Train a Support Vector Machine using the train set and get the accuracy on the test set

5. Perform K-fold cross validation and get the cross validation score of the model

6. Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data

Curse of Dimensionality

Curse of dimensionality is the phenomenon where the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. Analyzing and organizing the data in a high-dimensional spaces (often with hundreds or thousands of dimensions) are always prone to various adverse outcomes. Most of the machine learning algorithms are very susceptible to overfitting due to the curse of dimensionality.

To overcome such situations, we do feature engineering where algorithms run their logic to reduce the higher no. of dimensions. PCA is one such feature extraction technique.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) uses "orthogonal linear transformation" to introduces a lower-dimensional representation of the dataset. It finds a sequence of linear combination of the variables called the principal components that explain the maximum variance and summarize the most information in the data and are mutually uncorrelated with each other.

PCA allows us to quantify the trade-offs between the number of features we utilize and the total variance explained by the data. PCA allows us to determine which features capture similiar information and discard them to create a more parsimonious model.

In order to perform PCA we need to do the following:

  1. Standardize the data.
  2. Use the standardized data to create a covariance matrix.
  3. Use the resulting matrix to calculate eigenvectors (principal components) and their corresponding eigenvalues.
  4. Sort the components in decending order by its eigenvalue.
  5. Choose n components which explain the most variance within the data (larger eigenvalue means the feature explains more variance).
  6. Create a new matrix using the n components.

6.1 PCA Steps

Observations

Observations

Applying PCA on 6 Components

7. Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier.

7.1 Splitting the data into training (70%) and testing set (30%).

Original Dataset

PCA reduced Dataset

7.2 SVM with PCA

Dataframe showing results of models with and without PCA

Observations

7.3 Use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy.

7.3.1 Grid Search with PCA dataset

Observations

7.3.2 Grid Search without PCA (with original dataset)

Observations

Dataframe of Grid Search suggested models with and without PCA

7.4 Modelling

8. Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings

8.1 Plot training vs cross validation scores

8.2 Consodilated dataframe of all models

Observations

Conclusion

We used correlation matrix and checked the relation of each feature with the class column to reduce the number of features in the dataset to 12 from 18.

PCA being a statistical technique to reduce the dimensionality of the data by the selecting the most important features that captures maximum information about the dataset, does the task here. Here we've reduced the dimension from 12 to 6 and selected those which explained 95% variance. Doing that it removes the correlated features as well, which we saw in the scatterplot before and after PCA.

However, some of the limitations which are clearly seen in this use case are: after implementing PCA on the dataset, we saw features getting converted into principal components. Principal components are the linear combination of original features. This makes the features less interpretable. Additionally, we know that one of limitation of PCA is it assumes linearity i.e. principal components are a linear combinations of the original features, which if not true will not give a sensible results..

We then applied Naive Bayes and Support Vector Classifier on the reduced features (dimensions) and got an accuracy of 67.5% and 78.3% respectively and precision (macro) score of 64% and 76% respectively. Recall (macro) score for the same was 65% and 77% respectively. We then also applied SVC on the 12 actual features (with interpretability) and saw an accuracy score of 92.9%, precision (macro) score of 92% and recall (macro) score of 93%, which is a way better score then SVC when applied on principal components.

Shape of dataset we were dealing with was 846 rows and 12 features + 1 class column. Effect of PCA can be more useful in large datasets with more features.

Based on learning curve, we can conclude that for Naive Bayes with principal components, both training and validation scores are volatile, however the validation score almost flattens beyond a training size of ~330. For SVC with principal components and original features, both training and validation scores increases with the increase in size of the dataset, which would mean the scores can be further increases with more training samples. Howevver, the gap between training and validation score for SVC with principal component is higher than then the others.